Skip to content

Comments

[nexus] FM alert requests and rendezvous task#9552

Open
hawkw wants to merge 22 commits intomainfrom
eliza/fm-alerts
Open

[nexus] FM alert requests and rendezvous task#9552
hawkw wants to merge 22 commits intomainfrom
eliza/fm-alerts

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Dec 19, 2025

This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the alert table to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a background fm_rendezvous task reconciles the requested alerts in the sitrep with the actual alert table. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.

I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.

This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.

@hawkw hawkw requested review from davepacheco and smklein December 19, 2025 21:57
@hawkw hawkw self-assigned this Dec 19, 2025
@hawkw hawkw added nexus Related to nexus fault-management Everything related to the fault-management initiative (RFD480 and others) labels Dec 19, 2025
@hawkw hawkw changed the title [nexus] fm alert requests and rendezvous task [nexus] FM alert requests and rendezvous task Dec 19, 2025
@smklein smklein assigned smklein and unassigned hawkw Jan 5, 2026
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure looks good, but I've got some questions about lifetimes of things

let class = class.into();
match self
.datastore
.alert_create(&opctx, id, class, payload.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.

  • Suppose we want to delete an alert record from cockroachdb
  • Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.
  • If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)

I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.

I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that the alert records are currently never deleted. I think we do need to figure out a strategy for that (see #8076). For the near term, I think we ought to add a nullable case_id column to the alert record, so that alerts created by FM alert requests can point back to the FM case that owns them --- we might want to use that to determine if FM alerts are safely deleteable by checking if that case is still in the current sitrep, or something. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me - do we have something similar for webhook_delivery_attempts? I'm noticing that schema references the alert record, and presumably would want to stop if the case was closed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would want to stop attempting delivery if a case is closed, necessarily. For instance, we might be trying to deliver an alert for the resolution of an Active Problem, which closes the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly just thinking that we would avoid deleting alerts which were requested by a case that still exists in the current sitrep as a way to avoid accidentally re-creating those alerts in reconciliation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmmm so in #9592 I mentioned a couple possible solutions to avoid re-creating alerts. I figured we were leaning towards option (1) - "guard the INSERT" operation - which would attempt to prevent re-creating alerts if our state is out-of-date. How would that work with a new case_id column on the alert table?

When we're doing reconciliation for a sitrep, we load cases, and generate alerts based on those cases, right? So from the point-of-view of a really slow Nexus acting on old data, it thinks there is a valid case associated with the alert, no? Or are you suggesting "only insert the alert if there exists a corresponding fm_case in the database which appears open"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was imagining that the case ID column would be used to guard against deleting an alert if it was requested by a case that is still open. Sorry if that was unclear!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so we'd still need a guard for the INSERT in that case, right? The code as-is operates on a sitrep that has been loaded into a particular Nexus's memory -- so it might actually be creating an alert for a fully closed case, unless we have some way to stop it in create_requested_alerts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's correct. I still need to work out how I want to do that.

@smklein smklein assigned hawkw and unassigned smklein Jan 5, 2026
Co-authored-by: Sean Klein <sean@oxide.computer>
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the question about alerts - I want to be cautious merging and adding alerts, without having a known back-stop for them. If we don't think we're going to be generating alerts immediately after this is merged, we can be more lenient - I just want to make sure we either fix it or keep our eye on it.

let class = class.into();
match self
.datastore
.alert_create(&opctx, id, class, payload.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me - do we have something similar for webhook_delivery_attempts? I'm noticing that schema references the alert record, and presumably would want to stop if the case was closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others) nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants